A fast divide-and-conquer algorithm for indexing human genome sequences 



Woong-Kee Loh**'*, Yang-Sae Moon*^, Wookey Lee'^ 

'^Department of Multimedia, Sungkyul University 
^Department of Computer Science, Kangwon National University 
'^Department of Industrial Engineering, Inha University 



Abstract 



- - 5ince the release of human genome sequences, one of the most important research issues is about indexing the genome 
Sequences, and the suffix tree is most widely adopted for that purpose. The traditional suffix tree construction algorithms 
have severe performance degradation due to the memory bottleneck problem. The recent disk-based algorithms also have 
'limited performance improvement due to random disk accesses. Moreover, they do not fully utilize the recent CPUs 

^ rWith multiple cores. In this paper, we propose a fast algorithm based on 'divide-and-conquer' strategy for indexing the 
human genome sequences. Our algorithm almost eliminates random disk accesses by accessing the disk in the unit of 
contiguous chunks. In addition, our algorithm fully utilizes the multi-core CPUs by dividing the genome sequences into 
'multiple partitions and then assigning each partition to a different core for parallel processing. Experimental results 

2^ show that our algorithm outperforms the previous fastest DIGEST algorithm by up to 3.5 times. 
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1. Introduction 

Due to recent advances in bio technology (BT), genome 
sequences of diverse organisms including human beings are 
collected into databases. The Human Genome Project 
■(HGP), which had been initiated in 1990, released the 
human DNA sequences of approximately 3Gb]:0 size in 
2003. Since the release, a lot of researches are under 
their way for harnessing the genome sequences. An es- 
sential research issue is about indexing large-scale genome 
sequences for efficient retrieving of genome subsequences of 
interest [H, H Si, [III, [H, El • The suffix tree is most widely 

'adopted for indexing genome sequences H, i, i, i, [H [Ij . 

In general, a suffix tree is created for a given string (or 
sequence) X and enables efficient exact matching and ap- 
proximate matching on substrings of A [^. We explain 
the suffix tree in more detail in Section [2l 

A lot of algorithms have been proposed for efficient 
construction of the suffix tree. Ukkonen's algorithm [lit 
is the most famous one which, given a string of length n, 
constructs the corresponding suffix tree in 0{n) time. The 
algorithm implicitly assumes that n is small enough so that 
the input string and the output suffix tree can be loaded in 
the main memory as a whole. However, genome sequences 
could be several million or billion times larger than the 
strings dealt with the traditional suffix tree construction 
algorithms such as Ukkonen's algorithm. Moreover, the 
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^bp stands for 'base pair.' There are four bases, namely ade- 
nine (A), cytosine (C), guanine (G), and thymine (T). 



suffix tree is about 10 ^ 60 times larger than the input 
sequence [1, [III [I3| ■ Hence, the application of Ukkonen's 
algorithm for large-scale genome sequences should cause 
severe disk swap in and out, which is generally called mem- 
ory bottleneck problem or thrashing [i,i,i,[i,[nl|Il. Ac- 
tually, TOP-Q algorithm an extension of Ukkonen's al- 
gorithm, took seven hours for constructing the suffix tree 
for genome sequences of 40Mbp, which is much smaller 
than the human genome sequences, and it could not finish 
for genome sequences of 60Mbp 'll!]. 

For coping with the memory bottleneck problem, a few 
disk-based algorithms have been proposed for constructing 
the suffix tree [1, [1, [M [H , 14 1 • Disks have much larger size 
than main memory at the lower cost; however, they re- 
quire much longer access time up to several hundred times. 
Hence, the disk-based algorithms are designed mainly to 
maximize the main memory utilization and the disk ac- 
cess efficiency. However, these algorithms have a common 
drawback that they incur random disk accesses. The disk 
access performance is dependent more on access patterns 
than access amount; even for accessing the same amount, 
the random disk access requires much more time than the 
sequential disk access. Thus, the disk-based algorithms 
have been improved in the way of decreasing the ratio of 
random disk accesses. 

Another problem of the previous disk-based algorithms 
is that they do not fully utilize the most up-to-date CPU 
technologies. Instead of raising the clock speed, recent 
CPUs are designed to have multiple, simultaneously run- 
ning cores that enable intra-CPU parallel processing. How- 
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ever, some previous algorithms run mostly on a single core, 
and the others suffer from severe interference among the 
threads and hence have little gain by parallel processing. 
We explain the problems of the previous algorithms in 
more detail in Section [31 

In this paper, we propose a fast algorithm based on 
'divide-and-conquer' strategy for constructing the suffix 
tree for large-scale human genome sequences. The most 
significant difference from the previous algorithms is that 
the proposed algorithm almost eliminates random disk ac- 
cesses by accessing the disk in the unit of contiguous chunks 
each of which stores an entire suffix subtree. In addition, 
our algorithm fully utilizes the multi-core CPUs by divid- 
ing the genome sequences into multiple, independent par- 
titions and then assigning each partition to a different core 
for parallel construction of suffix subtrees. As an exper- 
imental result, our algorithm finished construction of the 
suffix tree for the entire human genome sequences in 64 
minutes and outperformed DIGEST algorithm Q, which 
had previously been the fastest disk-based algorithm, by 
up to 3.5 times. 

This paper is organized as the following. In Section [2j 
we briefly explain on the suffix tree. In Section [3j we ex- 
plain on the previous disk-based suffix tree construction al- 
gorithms. We also explain the performance degradation by 
random disk accesses in the section. In Sectional we pro- 
pose a new disk-based suffix tree construction algorithm, 
and then in Section [5j we evaluate the performance of our 
algorithm through a series of experiments. 

2. Suffix tree 

Figure [1] shows the suffix tree for a short DNA sequence 
X = ATAGCTAGATCGS. The symbol '$' is appended at 
the end of X so as to prohibit any suffix in X from being 
the prefix of any other suffix. Given a query sequence S, 
the search begins from the root node of the suffix tree. 
From the outbound edges of the root node, an edge e is 
chosen such that the label of e is the prefix of S. If no such 
edge is found, the search ends; if found, the child node 
is visited by following the edge e, i.e., e is the inbound 
edge of Ng. Let I be the label length of e, pi{S) be the 
prefix of S of length and si{S) be the suffix of S of length 
Len{S) - I. Then, it holds that S = pi{S) ® si{S), where 
© is the sequence concatenation operator. The search for 
query subsequence si{S) begins recursively at the node 
in the same manner as the root node. The search goes on 
until a terminal node is reached in the suffix tree or there 
is no query (sub)sequence to be searched for. 

Let us take a query sequence S ~ AGATCG for ex- 
ample. In Figure [IJa), from the outbound edges of the 
root node, the edge with label 'A' is followed and then 
the node A^i is visited. The search for query subsequence 
si{S) = GATCG is performed recursively at the node A^i. 
The search continues until the terminal node with position 
6 is reached; it indicates that query sequence S is found 
at position 6 in the sequence X. Figure mb) shows the 




(a) Edge labels are represented with subsequences. 




(b) Edge labels are represented with (start, end) positions in X. 
Figure 1: Suffix tree for a sequence X = ATAGCTAGATCGS. 

suffix tree whose edge labels are represented with (start, 
end) positions in X. While the labels' representation sizes 
in Figure [Ha) are arbitrary, those in Figure [Hb) are all 
identical. 

3. Related work 

Hunt et al. [9] proposed the first disk-based suffix tree 
construction algorithm. Hunt's algorithm excludes con- 
struction of sufRx links, which caused severe memory bot- 
tleneck problem in Ukkonen's algorithm [l^. Hunt's algo- 
rithm divides the given genome sequences into partitions 
and then constructs a separate suffix subtree for each par- 
tition. Although Hunt's algorithm has 0{n^) complexity, 
it shows better indexing performance than Ukkonen's al- 
gorithm by reducing disk accesses. However, Hunt's al- 
gorithm incurs heavy random disk accesses since it stores 
each node in the suffix tree as a separate object using the 
persistent Java object storage interface called PJama Q. 
Actually, the algorithm was successful in indexing genome 
sequences of up to 286Mbp size, but it could not be used 
for indexing the human genome sequences |9|. 
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Tian et al. [ij] presented the Top-Down Disk-based 
(TDD) approach for constructing disk-based suffix trees. 
TDD consists of two algorithms: Partition and Write Only 
Top Down (PWOTD) algorithm based on Wotd-eager al- 
gorithm Q for constructing suffix trees and a memory 
buffer management algorithm for maximizing the perfor- 
mance of PWOTD algorithm. The performance of PWOTD 
algorithm highly depend on the settings of the memory 
buffer management algorithm 14|. Tian et al. IJ] showed 



that TDD incurred only one sixth of disk accesses than Dy- 
naCluster algorithm [5|, an extension of Hunt's algorithm, 
and that TDD constructed the suffix tree for the entire hu- 
man genome sequences in 30 hours. However, the memory 
buffer management algorithm in TDD assigns only a small 
portion of memory for keeping the suffix tree in main mem- 
ory, while it assigns the largest portion to input genome 
sequences. TDD uses Least Recently Used (LRU) policy 
for swapping out the memory buffers into disk while con- 
structing the suffix tree. Whenever PWOTD algorithm 
creates a new node TV, it needs to access iV's parent node 
P that could be previously stored far away from N. This 
causes random disk accesses, and the larger genome se- 
quences should cause more random accesses. 

Phoophakdee and Zaki ll] proposed an algorithm called 
TRELLIS, which eliminated data skewness among suf- 
fix subtrees by dividing genome sequences according to 
variable-length prefixes. Unlike Hunt's algorithm 9] and 
TDD J^] , TRELLIS can create suffix links optionally after 
the suffix tree is constructed. TRELLIS consists of three 
phases: prefix creation, partitioning, and merging phases. 
In the prefix creation phase, variable-length prefixes are 
created so that, for each prefix Pj, the suffix subtree Tj 
corresponding to the suffixes having the prefix Pj can be 
loaded into main memory as a whole. In the partition- 
ing phase, the entire genome sequences are divided into 
partitions so that each partition Ri and its correspond- 
ing suffix tree Ti can be loaded into main memory as a 
whole. Then, a suffix tree Ti is constructed for each par- 
tition in this phase. In the merging phase, for each prefix 
Pj created in the prefix creation phase, the suffix sub- 
trees Tij are extracted from the suffix trees Ti and then 
merged into a single suffix subtree Tj. Phoophakdee and 
Zaki [n| showed that TRELLIS outperformed TDD by up 
to 4 times and that it constructed the suffix tree for the en- 
tire human genome sequences in 4.2 hours. However, since 
TRELLIS extracts the suffix subtrees Tij stored at ran- 
dom positions in the suffix trees Ti in the merging phase, it 
incurs severe random disk accesses. Actually, the merging 
phase requires the longest execution time 

Ghoting and Makarychev [7| proposed an algorithm 
called WAVEFRONT based on 'partition-and-merge' strat- 
egy as TRELLIS [Hi]. WAVEFRONT divides the entire 
data into I/O-efficient partitions and processes each parti- 
tion independently. In [7[, WAVEFRONT was extended to 
be executed on a massively parallel system. The algorithm 
completed indexing the entire human genome sequences 
in 15 minutes on IBM Blue Gene/L system composed of 
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Figure 2: Disk read/write transfer rates: sequential read/ write per- 
formed much better than random read/write. 



1024 processors 0]. However, WAVEFRONT executed on 
a single processor showed no noticeable performance im- 
provement compared with TRELLIS 
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Barsky et al. Q proposed an algorithm called DIGEST 
which consists of two phases similar to the merge-sort al- 
gorithm. In the first phase, the entire genome sequence 
is divided into partitions of the same length so that each 
partition can be loaded into main memory. For each par- 
tition, the suffixes contained therein are sorted in main 
memory and then are stored in disk. In the second phase, 
the suffixes sorted separately in each partition are merge- 
sorted. Suffix blocks from each partition are read sequen- 
tially one by one into main memory. The suffixes in differ- 
ent blocks are compared with each other, and the small- 
est one is extracted and then saved in the output block. 
When the output block becomes full, it is stored in disk. 
This continues until all the input blocks are empty. The 
sorted suffixes is called a suffix array, and it is known 
that a suffix array can be easily converted into a suffix 
tree 0, [ll]. Barsky et al. 0] showed that DIGEST out- 
performed TRELLIS-h [H, an extension of TRELLIS fll^, 
by up to 40% and that the algorithm completed indexing 
the entire human genome sequences in about 85 minutes. 
However, DIGEST should read suffix blocks from each par- 
tition stored at random positions in the second phase and 
hence suffers from severe random disk accesses. Moreover, 
since the merging phases of TRELLIS and DIGEST can- 
not be parallelized, they have little performance gain even 
by using recent multi-core CPUs. 

As explained so far, the common drawback of the pre- 
vious algorithms is the performance degradation due to 
random disk accesses. Figure [5] shows an experimental re- 
sult of reading/ writing a disk volume of 100MB size. The 
volume was read and written sequentially and at random 
in the unit of 512KB and 4KB. In the figure, the sequen- 
tial read/write performed up to 112.1 and 47.7 times bet- 
ter than random read/write, respectively. The values in 
Figured] should be different according to experimental en- 
vironments, though it is always the case that sequential 
accesses have better performance than random accesses. 



3 



4. Proposed indexing algorithm 

In this section, we propose a new algorithm for index- 
ing human genome sequences. The human genome is com- 
posed of 46 chromosomes: 22 chromosome pairs numbered 
1 ^ 22 and x/y (sex) chromosomes. In this paper, we con- 
catenate the entire genome sequences into a single long se- 
quence and use this sequence as the input of our algorithm. 
This helps simplify indexing and searching algorithms. 

Our algorithm is designed based on divide-and-conquer 
strategy: it divides the entire human genome sequence into 
multiple independent partitions and then constructs the 
suffix tree separately for each partition. The suffix tree 
for each partition is constructed in a contiguous chunk in 
main memory. When the construction is completed, the 
chunk image is stored sequentially into disk as it is. Hence, 
unlike TRELLIS and DIGEST our algorithm has 

no performance degradation due to random disk accesses. 
Moreover, since the suffix trees for different partitions are 
constructed independently and are not merged thereafter, 
their construction can be done in parallel by fully utilizing 
the most up-to-date multi-core CPUs. According to these 
features, our algorithm achieves dramatic performance im- 
provement compared with the previous algorithms. 

Our algorithm represents each base as a 2-bit code as 
in d, U, lii ll6|; A, C, G, and T are represented as 00, 
01, 10, and 11, respectively. Since the human genome 
sequence has the size of approximately 3Gbp, the 2-bit 
coded sequence has the size of about 3Gbp / 4 = 750MB. 
Actually, after removing unidentified base pairs, the 2-bit 
coded sequence has the size of about 700MB and can be 
fully loaded in main memory. Our algorithm assigns mem- 
ory region for the full 2-bit coded genome sequence at the 
beginning and retains it to the end. 

Our algorithm divides the human genome sequence into 
partitions according to prefixes, i.e., the suffixes having the 
common prefix belong to the same partition. We explain 
how to determine the prefixes for partitioning at the end of 
this section. The partitions are not necessarily created by 
physically dividing the genome sequence, but only the suf- 
fix positions are managed for each partition. While scan- 
ning the entire genome sequence, our algorithm creates 
the lists of suffix positions simultaneously for every prefix 
determined earlier; the list for a prefix Pj (0 < j < m) con- 
tains the positions of suffixes having the prefix Pj , where 
m is the number of partitions. Although each of these 
lists has a small size, the entire lists occupy a considerable 
amount of memory. Hence, the lists are stored in disk right 
after their creation; each list is retrieved from disk only 
once when the suffix tree is about to be constructed for 
the corresponding partition. Our algorithm creates each 
list of suffix positions in a contiguous memory region to 
read/write the list with a single operation and hence to 
eliminate random disk accesses. To obtain the sizes of con- 
tiguous memory regions, our algorithm scans the human 
genome sequence to count the frequency of every prefix 
before creating the lists of suffix positions. 
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Figure 3; Example of adding suffixes into a suffix tree. 

When the creation of partitions (i.e., the lists of suffix 
positions in the human genome sequence) is completed, 
our algorithm constructs the suffix tree separately for each 
partition. At first, our algorithm creates an empty suffix 
tree without any node and then adds suffixes one by one 
into the suffix tree while scanning the corresponding list 
of suffix positions. Figure |3] shows an example of adding 
suffixes into a suffix tree. Figure [3Ja) shows a suffix tree 
before addition. Figure ^h) shows the result of adding 
a suffix 5*1 = AGTG$ into the suffix tree in Figure [Sja). 
Si has the prefix p2 {Si ) — AG of length 2 which matches 
the label of the outbound edge of A^i and then S2{Si) = 
TG$ does not have common prefix with any label of the 
outbound edges of N2. In this case, our algorithm creates 
a new outbound edge e of N2 and labels it with S2(*S'i) 
= TG$. The edge e is connected to a new terminal node 
P3, i.e., e becomes the inbound edge of P3. Figure |3Uc) 
shows the result of adding a suffix 5*2 ~ ACTG$ into the 
suffix tree in Figure [3][a). The label of the outbound edge 
of A^i partially matches the prefix ^1(52) = A of ^2. In 
this case, our algorithm cuts the outbound edge of iVi and 
adds a new internal node N[ ; the inbound edge of N[ has 
the label pi{S2) = A. A new outbound edge e is added to 
node N[ and is labeled with si{S2) = CTG$. The edge 
e is connected to a new terminal node ps, i.e., e becomes 
the inbound edge of p^ . 

Each time a suffix is added into the suffix tree, a new 
terminal node is created in the tree. Since every suffix 
ends with the symbol $, the suffix cannot be a prefix of 
any other suffixes and has a unique position in the human 
genome sequence. Hence, a terminal node should exist in 
the suffix tree for representing the unique position of each 
suffix. The terminal node should have an inbound edge in 
the tree. The edge is an outbound edge of either (1) an 
existing node (Figure Mjo) case) or (2) a new node added 
between the cut edges (Figure [3jc) case). There exist no 
other cases. 

Figure|3]shows the generalization of adding suffixes into 
the suffix tree by our algorithm. Let us assume that we 
have visited the node iV^ in the course of searching for a 
suffix S in FigureSlJa). The concatenation L = Li©- • -©L^ 
of edge labels from the root node to Ni should be the same 
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Figure 4: Generalization of adding suffixes into a suffix tree. 
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Figure 5: Data structure of our algorithm: the information on a node 
and its inbound edge is contained together. 



as the prefix pi{S) of length I = Len{L), i.e., L = pi{S). 
In case Li+i n si{S) = 0, an edge e labeled with si{S) 
and a new terminal node Np with the inbound edge e are 
added as in Figure |31[b) . In case i^+i n si{S) — L' 
0), a new internal node and a new terminal node 

Np are added as in Figure Sfc), where V — Len{L') and 
pi' (ii+i) — pi' {si{S)) — L' . Since the suffix always ends 
with $, we cannot have the case S* = i in Figure H] 

Figure [S] shows the data structure of our algorithm. As 
shown in the figure, the information on a node and its in- 
bound edge is contained together in a single data structure. 
The fields a and b represent the start and end positions of 
the inbound edge in the human genome sequence as shown 
in Figure [TJb). The field right contains the pointer to the 
next sibling node, and foo represents either (1) a pointer 
to the leftmost child node in case of an internal node or 
(2) the sufhx position in the genome sequence in case of a 
terminal node. The field misc contains miscellaneous in- 
formation on the node. The fields a, b, right, and foo are 
4-byte unsigned integers, while the field misc is a 2-byte 
unsigned integer. Hence, the data structure has the fixed 
length of 18 bytes. For distinguishing between the internal 
and terminal nodes, the field b is investigated. If 6 = n, 
where n is the length of genome sequence, it is a terminal 
node; if 5 < n, it is an internal node (refer to Figure[IJb)). 

We can efficiently construct the suffix trees using the 
data structure in Figure [S] We explain this using Figure [51 
which shows the representation of suffix trees in Figure S] 
using the data structure; Figures ~ correspond 
to Figures Ufa) HI^c), respectively. In Figure Ufa), the 



fields (ai,bi) and (ai+i, 6i+i) represent the start and end 
positions of labels Li and Li+i, respectively. The fields 
with X stand for "don't care" fields, which are not used 
nor updated here. The arrow indicates a pointer to a pos- 
sible distant node. The nodes Ni and Ni+i may not be 
adjacent as shown in the figure, though iV^+i is easily ac- 
cessed by following the pointer. Figure |6l[b) shows the 
case a new terminal node Np is added. The node Ni^i 
can be either an internal or a terminal node and is a sib- 
ling node of Np. In the figure, the leftmost child node of 
Ni has been changed from Ni^i to Np. This is because 
we can efficiently add Np as a new child node of Ni with- 
out accessing Ni+i and all its sibling nodes. Figure HJc) 
shows the case a new internal node N^_^_^ and a new termi- 
nal node Np are added. The field values of the Ni+i are 
copied to the newly allocated node region, and then the 
field a^+i is adjusted is not changed). The field val- 

ues of N^^^ are set in the region previously used by Ni-^i 
as shown in the figure. The node Np is a sibling node of 
Ni+i and is added as the leftmost child node of Nl_^^^ as in 
Figure EJb) . The key idea we would like to show in Fig- 
ure [6] is that, when a suffix is added, there is only slight 
modification in the suffix tree constructed so far; it can be 
done only by allocating new memory region(s) for one or 
two nodes and then setting a few appropriate field values 
therein. This is one of the features providing the efficiency 
of our algorithm. 

Our algorithm constructs a suffix tree in a main mem- 
ory chunk. Allocations of memory regions for new nodes 
(and their inbound edges) are made sequentially in the 
chunk. The pointers in Figures [S] and [5] are relative off- 
set values from the beginning of the chunk. Once the 
construction of a suffix tree is completed, our algorithm 
stores the chunk image into disk without any modifica- 
tion. When the chunk image is reloaded into main mem- 
ory, the pointers are still valid regardless of where it is 
reloaded. Since the chunk image is stored in and read from 
the disk sequentially, there is no performance degradation 
due to random disk accesses, and thus we have signifi- 
cantly improved performance. When multiple suffix trees 
are constructed in parallel, our algorithm allocates a sep- 
arate memory chunk for each suffix tree. Even in this 
case, the human genome sequence is loaded only once into 
the memory region shared by the simultaneous processes 
of our algorithm. This parallel processing enables more 
significant performance improvement. 

We now explain how to determine the prefixes for di- 
viding the human genome sequence into partitions. Each 
suffix in the genome sequence is assigned to a partition ac- 
cording to its prefix; every suffix in a partition has a com- 
mon prefix. Given a prefix length p, our algorithm creates 
a partition for each possible prefix of length p. The num- 
ber of partitions is 4^*. A weakness of this scheme is that it 
causes data skewness among the partitions ^ll'] ; there may 
be big differences among the sizes of partitions and hence 
the corresponding suffix trees. We tackle this weakness as 
follows. As p increases, the number of suffixes in each par- 
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Figure 6: Data structures corresponding to the suffix trees in Figure |4] 



tition decreases, and the size of corresponding suffix tree 
also decreases. We set p to be large enough to make the 
suffix tree sizes smaller than the size M of available main 
memory. Then, the simultaneous processes of our algo- 
rithm choose the partitions so that the estimated sizes of 
their corresponding suffix trees sum up very close to M. 
This can be done with simple computations. By fully uti- 
lizing main memory in this way, our algorithm achieves 
better indexing performance. 

The minimum length of prefixes is computed approxi- 
mately using the following Eq. ([1]): 



Pn 



log. 



M 



(1) 



where n is the length of human genome sequence and / 
is a multiplication factor to estimate the suffix tree size. 
M represents the size of remaining main memory after 
loading the entire 2-bit coded human genome sequence. / 
is defined as the maximum of — , where s is the length of 
a genome sequence and T is the size of the corresponding 
suffix tree. We estimate the size of a big suffix tree by 
test construction of small suffix trees. The / value greatly 
differs according to suffix tree construction algorithms and 
is about 30 32 in our algorithm. 

5. Performance evaluation 

In this section, we show the superiority of our algorithm 
through a series of experiments. We use the same data sets 
as those in Q . The first set is a short genome sequence of 
llOMbp size obtained from 6643 organisms. The second 
set is the entire human genome sequence of about 3Gbp 



size. These data sets are denoted as VDB and HG18, 
respectively. 

The hardware platform is a PC equipped with Intel 
Core2Quad Q9550 2.83GHz CPU, Samsung DDR3 8GB 
main memory, and a 500GB 7200rpm hard disk. The soft- 
ware platforms are Ubuntu 10.10 32bit Linux and Win- 
dows 7 64bit Edition. The first experiment was performed 
on Ubuntu as in [2] , and the second and third experiments 
were performed on Windows 7. The latter two experiments 
were also performed on Ubuntu, though we had 10 ^ 15% 
better performance on Windows 7. As C/C++ compilers, 
we used GNU C++ 4.4.5 on Ubuntu and Visual C++ 2010 
Express Edition on Windows 7. 

In the first experiment, we compared the performance 
of our algorithm with DIGEST [3|, which had been the 
fastest disk-based suffix tree construction algorithm. We 
downloaded the source code of DIGEST from the author's 
web sitcH- In this experiment, we ran our algorithm and 
DIGEST on VDB data set and compared their elapsed 
time for constructing the suffix tree^. Figure [7] shows 
the result of experiment; our algorithm outperformed DI- 
GEST by up to 3.5 times. We executed only one process 
of our algorithm in this experiment. If we had executed 
multiple parallel processes of our algorithm, we could have 
achieved higher performance improvement. 

In the second experiment, we ran our algorithm on 
both VDB and HGIS data sets and compared the elapsed 



Wttp : //webhome . cs . uvic . ca/~mgbarsky/| 

''We also tried the experiment on HG18 data set; however, DI- 
GEST always terminated abnormally with the segmentation fault 
error. We discussed on this with the author of DIGEST, but we 
could not solve the problem to the end. 
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Figure 7: Result of first experiment: our algorithm outperformed 
DIGEST by up to 3.5 times. 



time for various numbers of parallel processes of our algo- 
rithm. Figure [5] shows the experimental result. Since the 
hardware platform has a four-core CPU, we increased the 
number of parallel processes up to four. Actually, we could 
have almost no performance improvement by running more 
than four parallel processes on the same platform. Note 
that the units of vertical axes are seconds and minutes in 
Figures IHJa) andlS^b), respectively. As shown in the fig- 
ures, we obtained performance improvement by up to 3.0 
times by running four parallel processes compared with 
a single process. We could not obtain four times perfor- 
mance improvement mostly due to inter-process commu- 
nication and synchronization. Since our algorithm is de- 
signed to minimize the effect of disk accesses, it has high 
potential of more performance improvement by using the 
advanced CPUs with more cores and faster clock speeds. 

In the third experiment, we measured the elapsed time 
of our algorithm for various sizes of genome sequences. We 
ran four processes on the genome sequences consisting of 
the first 2, 5, 8, 11, 15, and 24 chromosomes in the human 
genome sequence. Figure [9] shows the result. As the result 
of regression analysis on the experimental result, we could 
find that the elapsed time is almost linearly correlated with 
the size of genome sequences. 
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